{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 构建同学兴趣爱好图数据库\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "先对数据进行建模: \n",
    "\n",
    "- 实体: 同学, 爱好\n",
    "- 关系: \"喜欢做\"\n",
    "\n",
    "因此, 图数据库中应包含 \"同学\" 结点, 以及 \"爱好\" 结点, 其间通过 \"喜欢做\" 进行关联.\n",
    "\n",
    "```cypher\n",
    "CREATE (p:Person { name: 'Xiaoming' })-[:LIKES]->(h:Hobby { name: '唱歌' })\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据载入"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同学兴趣爱好表格原始数据类似如下:\n",
    "\n",
    "| name | id | hobbies |\n",
    "| :--- | :--- | :--- |\n",
    "| ... | ... | ... |\n",
    "| 小明 | 0000001 | 唱歌 听音乐 看电视剧 看小说 追星 |\n",
    "| 小红 | 0000002 | 音乐，睡觉 |\n",
    "| ... | ... | ... |\n",
    "\n",
    "将表格另存为 CSV 格式后, 可以载入到 Python 中. 这里使用 `csv.DictReader` 将信息读取为字典的形式:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "with open('hobbies.csv') as f:\n",
    "    data = [record for record in csv.DictReader(f)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 内容处理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "留意到 `hobbies` 字段的文本由多个字段组成，但是每项记录中各字段间采用的分隔符各不相同，可以采用 `re.split()` 方法进行处理。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "分割字段示例代码: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "for record in data:\n",
    "    # 分割字段\n",
    "    record['hobbies'] = re.split('，|、|\\ |；|, |;', record['hobbies'])\n",
    "    # 去除空项\n",
    "    record['hobbies'] = [token for token in record['hobbies'] if len(token) > 0]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "但留意到存在类似“游戏小说”这样将两项汉语词汇合并写成一个词的, 需要分割开来, 可以使用 `jieba.lcut()` 方法, 参考:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jieba\n",
    "\n",
    "jieba.lcut('游戏小说')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "显示处理后的 `data`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 提取兴趣爱好关键词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "总结出爱好中所有出现过的项目, 存储在集合 `hobbies` 中."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hobbies = set()\n",
    "\n",
    "for record in data:\n",
    "    for token in record['hobbies']:\n",
    "        for word in jieba.lcut(token):\n",
    "            hobbies.add(word)\n",
    "\n",
    "hobbies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "集合中包含一些动词和一些无意义的词, 这里可以手动去除.\n",
    "\n",
    "先查看有哪些词:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "[word for word in hobbies if len(word) == 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "去除一些词:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hobbies.remove('和')\n",
    "hobbies.remove('与')\n",
    "hobbies.remove('在')\n",
    "hobbies.remove('看')\n",
    "hobbies.remove('想')\n",
    "hobbies.remove('打')\n",
    "hobbies.remove('上')\n",
    "hobbies.remove('听')\n",
    "hobbies.remove('站')\n",
    "hobbies.remove('里')\n",
    "hobbies.remove('刷')\n",
    "hobbies.remove('b')\n",
    "hobbies.remove('一个')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hobbies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 将内容写出为 CSV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rows = []\n",
    "\n",
    "for person in data:\n",
    "    for token in person['hobbies']:\n",
    "        for item in hobbies:\n",
    "            if token.find(item) != -1:\n",
    "                rows.append({\n",
    "                    'name': person['name'],\n",
    "                    'id': person['id'],\n",
    "                    'likes': item,\n",
    "                })\n",
    "\n",
    "rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('hobbies-for-import.csv', 'w') as f:\n",
    "    dict_writer = csv.DictWriter(f, fieldnames=rows[0].keys(), dialect=csv.excel)\n",
    "    dict_writer.writeheader()\n",
    "    dict_writer.writerows(rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 导入到 Neo4j 数据库\n",
    "\n",
    "先将 csv 文件放入 Neo4j 安装目录下的 `import` 目录中, 之后便可以在 Cypher 语句中使用 `file:///` 访问文件.\n",
    "\n",
    "```cypher\n",
    "LOAD CSV WITH HEADERS FROM 'file:///hobbies-for-import.csv' AS row \n",
    "MERGE (p:Person {name: row.name, id: row.id}) \n",
    "WITH p, row \n",
    "MERGE (hobby:Hobby {name: row.likes}) \n",
    "MERGE (p)-[r:LIKES]->(hobby)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用 Cypher 在 Neo4j 数据库中查询"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 查询结点\n",
    "\n",
    "查询带有 `Person` 标签, 且属性 `name` 的值为 `小明` 的节点.\n",
    "\n",
    "```cypher\n",
    "MATCH (p:Person) \n",
    "WHERE\n",
    "    p.name='小明' \n",
    "RETURN p\n",
    "```\n",
    "\n",
    "```cypher\n",
    "MATCH (p:Person {name:'小明'}) RETURN p\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 查询指定关系的节点\n",
    "\n",
    "查询拥有共同爱好的人\n",
    "\n",
    "```cypher\n",
    "MATCH (p:Person {name:'小明'})-[:LIKES]->(:Hobby)<-[:LIKES]-(pp:Person)\n",
    "RETURN pp\n",
    "```\n",
    "\n",
    "查询拥有的爱好\n",
    "\n",
    "```cypher\n",
    "MATCH (p:Person {name:'小明'})-[:LIKES]-(hobbies:Hobby)\n",
    "RETURN hobbies\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 查询节点及之间关系\n",
    "\n",
    "查询拥有的爱好\n",
    "\n",
    "```cypher\n",
    "MATCH q=(p:Person {name:'小明'})-[:LIKES]-(hobbies:Hobby)\n",
    "RETURN q\n",
    "```\n",
    "\n",
    "查询拥有共同爱好的人\n",
    "\n",
    "```cypher\n",
    "MATCH q=(p:Person {name:'小明'})-[:LIKES]-(:Hobby)-[:LIKES]-(:Person)\n",
    "RETURN q\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 参考链接\n",
    "\n",
    "- [What is a Graph Database? - Developer Guides - Neo4j](https://neo4j.com/developer/graph-database/)\n",
    "- [Graph Modeling Guidelines - Developer Guides - Neo4j](https://neo4j.com/developer/guide-data-modeling/)\n",
    "- [Getting Started with Cypher - Developer Guides - Neo4j](https://neo4j.com/developer/cypher/intro-cypher/)\n",
    "- [Importing CSV Data into Neo4j - Developer Guides - Neo4j](https://neo4j.com/developer/guide-import-csv/)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "b89b5cfaba6639976dc87ff2fec6d58faec662063367e2c229c520fe71072417"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
